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Abstract 


A  command  and  control  (C2)  system  depends  crucially  on  having  high-quality  underlying  data. 
There  is  still  no  “best”  set  of  data  quality  dimensions  and  metrics  for  C2.  We  consider  the 
16  data  quality  criteria  identified  by  the  Total  Data  Quality  Management  (TDQM)  research  com¬ 
munity,  as  well  as  the  dimensions  identified  by  the  ISO  8000  standard.  We  map  these  into  the 
criteria  commonly  applied  by  the  intelligence  community  (IC),  and  those  identified  by  various 
parts  of  the  U.S.  Department  of  Defense  (DoD).  The  IC’s  “usability”  criterion  covers  several  dif¬ 
ferent  concepts  that  are  difficult  to  measure.  Meanwhile,  the  DoD’s  Net-Centric  Data  Strategy 
(NCDS)  arguably  does  not  adequately  address  the  notion  of  data  timeliness.  The  NCDS  covers 
some  important  factors  such  as  believability  and  reputation,  but  the  coverage  is  primarily  limited 
to  using  authoritative,  vetted  data  sources.  This  does  not  address  important  situations  where  data 
comes  from  a  variety  of  sources  with  varying  degrees  of  reliability.  On  the  other  hand,  the 
TDQM  criteria  do  not  adequately  capture  the  notions  of  readiness  and  adaptability.  Once  an 
accepted  set  of  data  quality  characteristics  and  associated  metrics  for  C2  is  available,  there  is  a 
good  case  for  explicitly  incorporating  it  into  C2  system  operations. 


1  Introduction 

The  transition  to  a  net-centric  environment  and  the  increasing  automation  of  command  and 
control  (C2)  functions  make  the  quality  of  the  underlying  data  upon  which  decisions  and  actions 
are  based  critical  to  success.  Operating  on  bad  data  can  have  serious  consequences,  especially  in 
a  military  context.  In  the  commercial  arena,  it  is  estimated  that  operating  on  poor  data  has  an 
economic  cost  of  about  $600  B  annually  [1],  A  few  of  the  many  side  effects  of  poor  data  quality 
include  delays  due  to  reconciling  data,  loss  of  credibility,  customer  dissatisfaction,  compliance 
problems,  delays,  lost  revenue  in  the  commercial  world,  and  loss  of  trust  in  the  automation  and 
computing  systems.  Properties  that  reflect  good  data — integrity,  provenance,  and  timeliness,  as 
well  as  the  ability  to  share  the  data  with  others  and  to  have  a  common  understanding  of  its 
meaning — are  intuitively  desirable  but  are  not  routinely  incorporated  into  today’s  complex  sys¬ 
tems,  in  part,  because  the  underlying  architectures  do  not  make  data  quality  a  primary  objective 
of  system  design.  In  the  military  C2  domain,  the  effects  of  poor  data  can  have  even  more  disas¬ 
trous  consequences  than  in  other  domains.  Making  quality  considerations  an  inherent  part  of  the 
design  and  maintenance  processes  of  C2  systems  should  benefit  the  decision  making.  We  explore 
some  of  the  associated  challenges  and  issues. 

Data  is  a  resource  that  must  be  managed,  protected,  and  preserved  across  its  life  cycle  like  any 
other.  The  dominant  issues  confronting  data  management  in  large  enterprises  have  been  fre¬ 
quently  reported  and  include  missing  or  incorrect  data,  missing  or  incorrect  metadata  deploy¬ 
ment,  redundant  data  storage,  varying  data  semantics,  and  non-standard  data  fonnats.  These 
issues  are  also  of  prime  concern  in  C2  systems.  Data  portability  (freeing  the  data  from  stove- 
piped  applications)  is  also  a  common  concern  in  both  domains. 
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Various  investigators  have  given  different  definitions  of  the  tenns  “information”  and  “data,” 
depending  on  the  context.  For  this  paper,  we  define  these  terms  as  follows: 

•  Information  is  defined  as  knowledge  concerning  objects,  such  as  facts,  events,  things, 
processes,  or  ideas,  including  concepts,  that  within  a  certain  context  has  a  particular 
meaning  [2]. 

•  Data  is  defined  as  the  reinterpretable  representation  of  information  in  a  formalized  man¬ 
ner  suitable  for  communication,  interpretation  or  processing  [2]. 

In  the  context  of  this  paper,  data  includes  both  raw  and  processed  infonnation  and  “all  data 
assets  such  as  system  files,  databases,  documents,  official  electronic  records,  images,  audio  files, 
Web  sites,  and  data  access  services”  [3]. 

For  C2  functions,  data  is  used  to  develop  situational  awareness  and  a  common  operating  picture 
(COP)  by  which  commanders  make  decisions  and  effect  control.  Commanders  require  many 
types  of  data — ranging  from  logistics  to  weather  to  geospatial  to  tactical  infonnation — to  support 
the  various  warfighter  operations.  Data  must  be  collected,  analyzed,  and  communicated  via  vari¬ 
ous  manual  and  automated  messages  and  exchanged  between  various  C2  systems  and  people.  A 
commander  has  little  control  over  the  sources  that  supply  data  to  his  C2  systems,  especially  in 
times  of  crisis.  Each  C2  system  may  store  portions  of  current  data  and  maintain  some  amount  of 
past  data  for  historical  analysis  purposes.  The  tempo  of  activity  and  the  volume  of  data  on  which 
a  system  depends  are  both  rapidly  increasing,  revealing  many  stress  points  in  the  current  sys¬ 
tems.  In  general  terms,  a  modern  C2  system  is  a  large,  heterogeneous,  distributed,  real-time 
processing  system  that  is  resource  limited  (bandwidth  and  computation  power)  at  some  of  the 
end-points,  with  frequent  disruptions  and  highly  dynamic  infonnation  flows.  The  data  is  con¬ 
tained  in  multiple,  distributed  storage  facilities  and  heterogeneous  databases.  As  data  is  delivered 
with  higher  frequency  from  more  places,  decision  makers  must  become  more  responsive  and 
operate  faster.  Modem  C2  systems,  especially  in  a  coalition  environment,  are  among  the  most 
complex  systems  imaginable. 

In  this  paper,  we  examine  a  number  of  important  science  and  technology  (S&T)  issues  relating  to 
data  quality  in  C2  systems.  First,  we  discuss  the  characterization  of  the  various  quality  properties 
of  data.  We  then  examine  several  of  these  quality  characteristics  in  the  context  of  C2  systems. 
Finally,  we  offer  some  suggestions  for  further  S&T  areas  to  address  some  of  the  issues. 

2  Data  Quality 

Data  quality  can  be  simply  defined  as  the  fitness  for  use  of  the  data  [4].  A  more  practical  defini¬ 
tion  is  the  degree  to  which  data  “meets  the  requirements  of  its  authors,  users,  and  administrators” 
[5].  The  key  point  to  be  taken  from  these  definitions  is  that  the  generic  notion  of  the  quality  of 
data,  like  many  other  ideals  of  quality,  is  dependent  on  context  or  intended  use.  Nevertheless, 
given  that  data  is  such  a  pervasive  part  of  any  information  technology  (IT)  system,  there  are 
many  ways  of  partitioning  its  quality  properties.  In  some  early  data  quality  research,  data  was 
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primarily  characterized  by  Accuracy,  Completeness,  Timeliness,  and  Standards  (ACTS).  This 
basic  list  has  been  expanded  over  the  years  in  many  directions.  In  particular,  since  the  early 
1990s,  a  Total  Data  Quality  Management  of  (TDQM)  research  community  [6],  has  expanded 
ACTS  to  16  data  quality  dimensions  and  successfully  used  them  in  assessments  of  an  organiza¬ 
tion’s  data  quality  environment: 

•  Accessibility  -  The  extent  to  which  data  is  available  or  easily  and  quickly  retrievable 

•  Amount  of  Infonnation  -  The  extent  to  which  the  volume  of  data  is  appropriate  for  the 
task  at  hand 

•  Believability  -  The  extent  to  which  data  is  regarded  as  true  and  credible 

•  Reputation  -  The  extent  to  which  infonnation  is  highly  regarded  in  tenns  of  its  source  or 
content 

•  Completeness  -  The  extent  to  which  information  is  not  missing  and  is  of  sufficient 
breadth  and  depth  for  the  task  at  hand 

•  Conciseness  -  The  extent  to  which  data  is  compactly  represented 

•  Consistent  Representation  -  The  extent  to  which  the  data  is  presented  in  the  same  format 

•  Ease  of  Operations  -  The  extent  to  which  data  is  easy  to  operate  on  and  apply  to  different 
tasks 

•  Free-of-Error  -  The  extent  to  which  data  is  correct  and  reliable 

•  Interpretability  -  The  extent  to  which  data  is  in  appropriate  languages,  symbols,  and  units 
and  the  extent  to  which  the  definitions  are  clear 

•  Objectivity  -  The  extent  to  which  data  is  unbiased,  unprejudiced,  and  impartial 

•  Relevancy  -  The  extent  to  which  data  is  applicable  and  helpful  for  the  task  at  hand 

•  Security  -  The  extent  to  which  access  to  data  is  restricted  appropriately  to  maintain  its 
security 

•  Timeliness  -  The  extent  to  which  data  is  sufficiently  up-to-date  for  the  task  at  hand 

•  Understandability  -  The  extent  to  which  data  is  easily  comprehended 

•  Value  Added  -  The  extent  to  which  data  is  beneficial  and  provides  advantages  from  its 
use 

These  16  characteristics  can  be  grouped  into  the  following  four  categories: 

•  Intrinsic  -  Accuracy,  reputation,  believability,  objectivity 

•  Accessibility  (Operational)  -  Accessibility,  access  control 

•  Contextual  -  Relevancy,  timeliness,  completeness,  amount  of  infonnation,  value  added 

•  Representational  -  Conciseness,  consistent  representations,  ease  of  operations,  interpre¬ 
tability,  understandability 

The  intrinsic  properties  relate  to  the  accuracy  and  pedigree  of  the  data  and  do  not  change 
depending  on  environment  or  intended  use.  Accessibility,  in  this  usage,  refers  to  the  system 
properties  such  as  how  and  where  the  data  is  stored  and  the  means  of  protecting  the  data,  such  as 
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access  control.  Contextual  properties  depend  on  the  application  for  which  the  data  is  used  and 
can  have  temporal  behavior.  The  representational  properties  are  the  more  common  notions  of 
standardization  and  interoperability.  These  categories  also  help  to  show  how  the  characteristics 
are  related  to  each  other  and  the  environment  in  which  they  are  situated. 

An  International  Organization  for  Standardization  (ISO)  standard  on  Data  Quality,  ISO  8000  [7], 
is  currently  being  developed.  It  is  primarily  aimed  at  quality  facets  of  automated  infonnation 
exchange  for  the  purchase  of  goods.  ISO  8000  defines  formats  for  descriptions  of  individuals, 
organizations,  locations,  and  goods  or  services.  It  defines  data  quality  using  five  characteristics: 
Syntax,  Provenance,  Completeness,  Accuracy,  and  Certification  and  considers  the  processes  that 
are  needed  to  assure  data  quality.  Reference  [8]  defines  master  data  as  data  held  by  an  organiza¬ 
tion  that  describes  the  independent  and  fundamental  entities  for  an  enterprise.  For  an  organiza¬ 
tion,  this  might  include  descriptions  of  customers,  suppliers,  products,  locations,  and  so  forth. 
ISO  8000  Part  110  focuses  on  requirements  for  exchange  of  master  data  that  can  be  checked 
through  automation  [9].  The  representation  and  exchange  of  infonnation  about  provenance  (Part 
120),  accuracy  (Part  130),  and  completeness  (Part  140)  have  also  been  recently  published.  Prov¬ 
enance  information,  for  example,  may  include  the  record  of  origination,  transcription,  abstraction 
validation,  ownership,  and  transfer  of  ownership  of  data. 

In  general,  ISO  8000  is  oriented  toward  logistics  information,  manufacturing  applications,  or 
Enterprise  Resource  Planning  (ERP)  systems.  It  has  been  supported  by  organizations  such  as  the 
North  Atlantic  Treaty  Organization  (NATO)  and  the  Defense  Logistics  Infonnation  Service 
(DLIS).  DLIS  has  supported  the  transition  of  the  Federal  Catalog  System  (FCS)  and  NATO 
Codification  System  (NCS)  into  these  open  public  standards.  The  Federal  Logistics  Information 
System  (FLIS)  provides  automated  data  on  the  FCS  and  descriptions  of  items  of  supply  for  the 
U.S.  military.  It  serves  as  the  common  frame  of  reference  for  Department  of  Defense  (DoD) 
buyers  to  communicate  with  their  industrial  supplier  base  [10]. 

ISO  8000  is  closely  aligned  with  other  data  exchange  standards,  such  as  the  ISO  22745  Open 
Technical  Dictionary  (OTD),  which  defines  concepts  for  describing  items,  and  a  query  interface 
for  accessing  the  definitions  [11].  The  Electronic  Commerce  Code  Management  Association 
(ECCMA)  Open  Technical  Dictionary  (eOTD)  is  an  ISO-22745-compliant  dictionary  that  has 
evolved  from  the  NCS  and  is  directed  toward  the  global  commercial  environment  [12],  An 
eOTD  catalog  is  composed  of  Extensible  Markup  Language  (XML)  files  that  contain  infonna¬ 
tion  explicitly  encoded  using  eOTD  concept  identifiers.  This  is  based  on  the  NCS,  which 
describes  a  common  supply  language  for  NATO’s  logistical  needs.  Currently,  over  31  million 
reference  numbers,  22  million  users,  and  1.5  million  organizations  are  registered  in  the  system. 

Another  ISO  Data  Quality  standard,  the  ISO/IEC  [International  Electrotechnical  Commission] 
25012  (“Data  Quality  Model”),  is  under  development  in  the  domain  of  software  engineering  and 
software  quality  [13],  This  data  quality  standard  is  part  of  a  family  of  standards  (25012,  25020, 
25021,  25030)  defining  systems  and  software  engineering  quality  requirements  and  measure- 
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ments,  called  the  SQuaRE  standards,  which  are  from  the  software  perspective.  The  ISO/IEC 
25012  document  is  aimed  at  structured  data  stored  in  computer  systems  and  defines  15  data 
quality  characteristics  from  two  points  of  view:  inherent  and  external.  Inherent  data  quality  is 
similar  to  the  intrinsic  category  discussed  previously,  and  external  data  quality  refers  to  system- 
dependent  aspects  that  preserve  data  quality.  The  15  characteristics  are 

•  Accuracy  -  The  extent  to  which  data  has  attributes  that  correctly  represent  the  true  value 
of  the  intended  attribute  of  a  concept  of  event  in  a  specific  context  of  use 

•  Completeness  -  The  extent  to  which  subjects  associated  with  an  entity  have  values  for  all 
expected  attributes  and  related  entity  instances  in  a  specific  context  of  use 

•  Consistency  -  The  extent  to  which  data  has  attributes  that  are  free  from  contradiction  and 
coherent  with  other  data  in  a  specific  context  of  use 

•  Credibility  -  The  extent  to  which  data  has  attributes  that  are  regarded  as  true  and  believ¬ 
able  by  users  in  a  specific  context  of  use 

•  Currentness  -  The  extent  to  which  data  has  attributes  that  are  of  the  right  age  in  a  specific 
context 

•  Accessibility  -  The  extent  to  which  data  has  attributes  that  enable  it  to  be  reached  in  a 
specific  context  of  use,  particularly  by  people  who  need  supporting  technology  or  special 
configuration  because  of  some  disability 

•  Compliance  -  The  extent  to  which  data  has  attributes  that  adhere  to  standards,  conven¬ 
tions,  or  regulations  in  force  and  similar  rules  relating  to  data  quality  in  a  specific  context 
of  use 

•  Confidentiality  -  The  extent  to  which  data  has  attributes  that  ensure  that  it  is  accessed 
and  interpreted  only  by  authorized  users  in  a  specific  context  of  use 

•  Perfonnance  -  The  extent  to  which  data  has  attributes  that  can  be  processed  and  provide 
the  expected  level  of  perfonnance  by  using  the  appropriate  amounts  and  types  of 
resources  under  stated  conditions  and  in  a  specific  context  of  use 

•  Precision  -  The  extent  to  which  data  has  attributes  that  are  exact  or  that  provide  discrimi¬ 
nation  in  a  specific  context  of  use 

•  Traceability  -  The  extent  to  which  data  has  attributes  that  provide  an  audit  trail  of 
accesses  to  the  data  and  of  any  changes  made  to  the  data  in  a  specific  context  of  use 

•  Understandability  -  The  extent  to  which  data  (and  associated  metadata)  has  attributes  that 
enable  it  to  be  read  and  easily  interpreted  by  users  and  are  expressed  in  appropriate  lan¬ 
guages,  symbols,  and  units  in  a  specific  context  of  use 

•  Availability  -  The  extent  to  which  data  has  attributes  that  enable  it  to  be  retrieved  in  a 
specific  context  of  use 

•  Portability  -  The  extent  to  which  data  has  attributes  that  enable  it  to  be  moved  from  one 
platform  to  another,  preserving  the  existing  quality  in  a  specific  context  of  use 

•  Recoverability  -  The  extent  to  which  data  has  attributes  that  enable  it  to  maintain  and 
preserve  a  specified  level  of  operations  and  quality,  even  in  the  event  of  failure,  in  a  spe¬ 
cific  context  of  use 
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There  is  clear  overlap  with  the  TDQM  characteristics  but  also  some  key  differences,  primarily 
from  the  operational  viewpoint.  Some  of  the  operational  characteristics  that  are  not  stressed  in 
TDQM  include  Performance,  Portability,  Recoverability,  and  Availability.  In  ISO/IEC  25012, 
Compliance  refers  to  adherence  to  standards  and  regulations,  something  TDQM  does  not  expli¬ 
citly  consider.  ISO/IEC  25012  also  groups  the  characteristics  according  to  whether  they  refer  to 
inherent  or  external  data  quality  characteristics,  or  both.  Accuracy  through  Understandability  are 
inherent,  and  Accessibility  through  Recoverability  are  external.  Accessibility  through  Under¬ 
standability  have  both  features. 

The  intelligence  community  (IC)  has  traditionally  been  very  concerned  about  data  quality.  The 
Joint  Military  Intelligence  Committee  identified  six  characteristics  of  data  quality  [14]: 

•  Accuracy :  Data  and  its  sources  are  evaluated  for  technical  errors,  misperceptions,  deliber¬ 
ate  efforts  to  mislead. 

•  Objectivity.  The  data  is  examined  for  deliberate  distortions  and  manipulations  due  to  self- 
interest. 

•  Usability.  Data  is  compatible  with  a  customer’s  capabilities  for  receiving,  manipulating, 
protecting,  and  storing  the  product  and  is  ready  when  needed. 

•  Relevance :  Infonnation  is  applicable  to  customer  requirements. 

•  Readiness :  Data  systems  must  be  responsive  to  the  dynamic  requirements  of  customers. 

•  Timeliness :  Data  must  be  available  and  acted  upon  when  it  is  required. 

These  properties  have  been  extended  to  the  16  TDQM  properties  as  described  in  Reference  [15]. 
The  six  basic  categories  above  are  naturally  slanted  toward  the  needs  of  the  IC.  Since  C2  systems 
rely  on  intelligence  products,  their  needs  are  similar. 

Data  sharing  and  accessibility  are  areas  that  have  received  much  public  attention  since  9/11.  The 
IC  is  also  very  worried  about  spoofing  or  the  injection  of  false  data  that  can  corrupt  decisions  or 
analyses.  There  is  a  great  need  to  track  sources  and  the  intennediate  handling  of  data  to  detect 
deliberate  deception  attempts.  Another  concern  is  that  of  inconsistent  data  that  can  arise  from 
multiple  observers.  Non-authoritative  sources  of  data  are  also  a  persistent  problem,  and  proper 
weighting  is  needed.  In  some  C2  systems,  such  as  the  Global  Command  and  Control  System 
(GCCS),  the  data  is  generally  vetted  and  considered  authoritative,  while  in  others,  such  as  the 
Tactical  Ground  Reporting  (TIGR)  System,  the  data  can  be  entered  by  any  user  who  observes  an 
interesting  event.  Both  types  of  systems  have  their  uses,  but  the  differences  show  that  the  pedi¬ 
gree  of  data  should  be  an  explicit  factor.  Another  interesting  IC  and  C2  issue  is  that  information 
that  was  presented  as  true  may  later  be  found  to  be  untrue  and  that  this  metainformation  needs  to 
be  disseminated  as  well.  However,  some  data  quality  properties,  such  as  timeliness  and  accuracy, 
can  have  a  more  severe  impact  in  a  C2  tactical  situation.  It  is  not  acceptable,  for  example,  to  tar¬ 
get  the  wrong  building. 

The  DoD  has  recognized  data  quality  as  an  important  issue  in  the  last  decade  and  has  published 
the  following  key  documents [16]: 
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•  DoD  Net-Centric  Data  Strategy  (NCDS),  May  2003 

•  Data  Sharing  in  a  Net-Centric  Department  of  Defense,  Dec.  2004 

•  Guidance  for  Implementing  Net-Centric  Data  Sharing,  Apr  .2006 

•  DoD  Command  and  Control  (C2)  Strategic  Plan  Version  1.0,  Dec.  2008 

•  Interim  Guidance  to  Implement  NCDS  in  the  C2  Portfolio,  Mar.  2009 

•  DoD  C2  Implementation  Plan  Version  1.0,  Oct.  2009 

The  DoD  NCDS  [17]  and  the  Army  Data  Transformation  (ADT)  [18]  effort  are  two  examples  of 
strategy  developed  in  this  area.  Both  documents  are  designed  for  a  larger  community  than  C2, 
which  is  considered  one  Community  of  Interest  (COI).  However,  both  directly  affect  the  direc¬ 
tion  of  current  and  planned  C2  systems. 

The  NCDS  defines  seven  goals  in  its  data  strategy: 

1.  Visible  (who  has  data  and  what  kind  it  is)  -  Data  can  be  discovered  through  search  of 
catalogs,  registries,  and  so  forth.  Visibility  is  accomplished  through  use  of  metadata 
descriptions. 

2.  Accessible  (where  and  what  format)  -  Data  is  posted  to  storage  areas  where  it  can  be 
obtained  by  others.  The  data  is  accompanied  by  metadata  descriptions.  The  data  is  made 
available  to  others  based  on  access  control  policy. 

3.  Understandable  (what  its  meaning  is)  -  Data  syntax  and  its  semantic  meaning  can  be 
uniquely  interpreted. 

4.  Institutionalized  (what  and  who  governs  it)  -  Data  is  incorporated  into  standard  processes 
and  practices. 

5.  Trusted  (trustworthy,  accurate,  and  authoritative)  -  The  validity  of  the  data  can  be 
assessed  based  on  its  provenance,  security  protection,  access  control,  and  integrity. 

6.  Interoperable  -  Data  can  be  shared  among  different  predefined  or  unanticipated  users  or 
systems.  Common  data  models  and  metadata  are  used  to  support  this  interoperability. 

7.  Responsive  to  users’  needs  (applicable  and  timely)  -  Methods  to  accommodate  user  per¬ 
spective  via  feedback  are  incorporated  into  the  data  practices. 

The  NCDS  claims  that  the  aforementioned  goals  do  not  include  data  quality  or  accuracy  consid¬ 
erations  but  that  achieving  the  goals  should  result  in  improved  data  quality  and  accuracy. 

The  ADT  plan  is  aimed  at  processes  to  improve  data  quality  as  the  systems  are  transformed  to 
net-centric  operations.  The  handling  of  data  is  tightly  coupled  with  the  Army  Enterprise 
Infonnation  Architecture  (EIA)  that  is  part  of  the  overall  Army  Enterprise  Architecture  (AEA), 
so  the  separation  of  data  and  architecture,  as  they  are  doing  it,  does  not  describe  the  implementa¬ 
tion  of  data  services.  The  AEA  is  a  service-oriented  architecture  that  deals  with  many  data- 
oriented  services  such  as  displays  (user-defined  dashboards),  common  exchange  schemas  such  as 
the  Universal  Core  (UCore),  and  interfaces  to  systems  such  as  GCSS.  A  good  description  of  the 
relationship  between  the  Army  Net-Centric  Data  Strategy  (ANCDS)  and  the  Army  Service 
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Oriented  Architecture  (SO A)  is  described  in  Reference  [19].  The  ADT  has  indicated  six  phases 
in  which  it  is  working  to  improve  data  management  and  data  quality: 

1 .  Accountable  -  Incorporate  common  data  standards  and  governance  practices. 

2.  Authoritative  -Identify  and  manage  master  data  elements  and  authoritative  sources. 

3.  Transform  -  employ  standardized  structures  and  schemas  such  as  data  yellow  pages  to 
improve  data  sharing. 

4.  Expose  -  Make  data  accessible  and  responsive  to  users  through  the  Anny  Data  Services 
Layer  (ADSL).  Lour  methods  of  exposing  data  are  Messaging,  Data  Services,  Data 
Warehouses,  and  Data  Security. 

5.  Register  -  Validate  data  schemas  and  services  against  standards  and  then  register  in 
repositories  (e.g.,  authoritative  data  repository)  to  enable  visibility  and  reuse. 

6.  Assess  -  Monitor  and  assess  data  maturity  levels  using  metrics.  Measure  the  progress  in 
improving  data  quality. 

A  key  portion  of  the  strategy  is  the  ADSL,  which  is  part  of  the  EAI  and  provides  application  ser¬ 
vices  for  standardized  handling  of  data,  such  as  [20]: 

•  Data  Mediation  -  Transfonn  data  among  different  types,  vocabularies,  and  semantics  to 
support  interoperability.  Services  include  Structural  Transform  Service,  Semantic  Media¬ 
tion  Services,  Data  Validation  Services,  and  Data  Brokering  Services. 

•  Data  Discovery  and  Data  Access  -  Provide  common  service-based  access  to  repositories 
for  search  and  retrieval  of  data  to  support  visibility  and  accessibility.  Services  include 
Data  Search,  Lederated  Search,  Data  Retrieval,  Data  Events,  and  Data  Streaming. 

•  Data  Abstraction  -  Make  data  understandable  through  use  of  metadata,  establish  a  com¬ 
mon  taxonomy,  and  manage  authoritative  sources.  Services  include  Metadata  Discovery, 
Metadata  Publishing,  and  Data  Abstraction. 

•  Data  Management  -  Provide  the  persistence  and  stewardship  of  data  resources  to  estab¬ 
lish  trusted  data.  Services  include  Data  Replication,  Data  Archival,  Data  Auditing,  and 
Reference  Data  Management. 

•  Data  Governance  -  Capture  and  govern  data  resources.  Services  include  Namespace, 
Schema,  and  Ontology  Management. 

The  ADSL  also  hides  the  details  of  the  lower  layers  of  data  handling,  such  as  databases  and 
repositories,  from  the  applications  and  users  to  enable  improved  data  portability.  The  connection 
with  the  data  quality  characteristics  of  the  NCDS  is  clear. 

In  the  Table  1,  we  present  an  initial  mapping  from  the  data  quality  concepts  of  ISO  8000, 
ISO/IEC  25012,  the  NCDS  goals  (and  the  ADT  phases),  and  the  IC  to  the  TDQM  16  categories. 

In  the  NCDS  column  of  Table  1,  we  have  indicated  in  parentheses  the  phases  of  the  ADT  that 
may  be  expected  to  have  the  most  impact  on  data  quality.  Lor  the  IC,  it  appears  that  usability 
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Table  1:  Comparison  of  Data  Quality  Characteristics 


TDQM 

DoD  NCDS 

Data  Goals 

IC 

ISO  8000 

ISO  25012 

Intrinsic: 

Free  of  error 

Accuracy 

Accuracy 

Accuracy, 

Precision 

Reputation 

Accountable 

(Authoritative) 

Certification 

Believability 

Accountable 

(Authoritative) 

Certification 

Credibility 

Objectivity 

(Provenance) 

Accountable 

(Authoritative), 

Trusted 

Objectivity 

Provenance 

Traceability 

Operational  (Accessibility): 

Accessibility 

Visible,  Accessible 
(Expose) 

Usability 

Accessibility, 

Availability, 

Portability, 

Recoverability, 

Performance 

Security  (Access 
Control) 

Trusted  (Expose) 

Confidentiality 

Contextual: 

Amount  of 
Information 

Relevance 

Responsive  to 

Users’  Needs 

Relevance, 

Readiness 

Value  added 

Timeliness 

Timeliness 

Currentness 

Completeness 

Completeness 

Completeness 

Representational: 

Understandability 

Understandable 

Usability 

Master  Data: 
Semantic 
encoding,  OTD 

Understandability 

Conciseness 

Ease  of  operation 

Performance 

Interpretability 

Interoperable 

Usability 

Master  Data 

Syntax 

Consistent 

Representations 

Institutionalized, 

Interoperable 

(Standards) 

Master  Data: 
Conformance 

Consistency, 

Compliance 

covers  several  areas  and  would  be  difficult  to  measure.  Also,  interestingly,  the  TDQM  list  does 
not  seem  to  capture  the  notion  of  readiness,  which  indicates  that  the  data  is  adaptable  to  changing 
circumstances  and  requirements.  The  ISO  8000  and  related  standards  provide  a  broad  range  of 
coverage;  however,  they  do  not  address  some  important  issues,  such  as  timeliness  or  ease  of 
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operation.  The  NCDS  also  fails  to  address  certain  properties,  particularly  timeliness,  which  is 
critical  to  C2.  Also,  although  the  table  indicates  that  NCDS  covers  some  areas  such  as  believ- 
ability  and  reputation,  the  extent  of  this  coverage,  which  is  primarily  limited  to  using  authorita¬ 
tive  data  sources  that  have  been  vetted,  does  not  span  all  the  situations  frequently  encountered  in 
C2,  such  as  data  from  a  variety  of  sources  with  varying  pedigree  (provenance,  reliability,  and  so 
forth). 

Other  studies  for  various  application  contexts  have  identified  many  additional  characteristics, 
such  as  a  study  of  data  quality  for  web  portals,  which  identified  42  different  quality  features  [21]. 
However,  as  we  mentioned  earlier,  we  are  primarily  seeking  to  use  these  characteristics  as  an 
organizational  tool  to  consider  the  major  issues  in  C2  systems,  as  opposed  to  compiling  a  com¬ 
plete  listing. 

3  Metrics  and  Tools 

It  is  sometimes  useful  to  employ  metrics  to  quantify  the  quality  of  the  data  under  consideration 
and  to  make  economic  or  strategic  decisions  on  how  to  improve  or  maintain  a  given  quality 
level.  Many  researchers  have  proposed  a  variety  of  metrics  and  generally  have  divided  them  into 
objective  and  subjective  measures,  but  their  interpretation  is  typically  context  dependent.  For 
instance,  in  some  applications,  such  as  digital  voice,  it  is  acceptable  to  have  a  percentage  of 
missing  data  without  appreciably  degrading  the  quality.  In  other  applications,  a  missing  value 
could  be  catastrophic. 

In  Reference  [6]  metrics  for  the  16  TDQM  features  are  defined  as  three  basic  fonns:  (1)  simple 
ratio,  (2)  min  or  max  and  (3)  weighted  average.  The  metrics  are  typically  normalized  between  0 
and  1. 

Using  a  simple  ratio,  it  is  possible  to  represent  completeness,  accuracy,  precision,  consistency, 
concise  representation,  relevancy,  and  ease  of  manipulation.  For  example,  an  accuracy  metric 
can  be  a  simple  ratio  of  the  number  of  accurate  records  divided  by  the  total  number  of  records. 
The  criteria  for  accuracy  are  a  function  of  the  context  or  application.  These  are  high-level  notions 
and  may  be  made  more  specific  to  satisfy  the  circumstances,  such  as  schema,  column,  and  popu¬ 
lation  completeness  in  a  database. 

Min  or  max  operations  can  be  used  for  metrics  that  are  composed  of  several  underlying  features. 
Examples  include  believability,  timeliness,  accessibility,  or  amount  of  data.  For  example,  timeli¬ 
ness  has  been  defined  [22]  as  max  [0,1-  ( age  at  delivery  I  shelf-life)  ],  where  age  at  delivery  is 
the  delivery  time  minus  data  creation  time  and  shelf-life  (volatility)  is  the  total  length  of  time  that 
data  is  valid  and  usable. 

If  the  age  is  less  than  the  shelf  life,  the  data  is  still  usable.  The  earlier  the  data  is  delivered,  the 
more  time  there  is  to  process  the  data  and,  thus,  the  larger  the  metric.  In  other  studies,  other 
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functional  forms  to  represent  the  decay  of  timeliness  are  employed,  and  the  function  is  often 
weighted  by  an  exponent  to  magnify  the  effects  of  the  timeliness. 

The  weighted  average  metrics  are  used  if  there  is  enough  detailed  infonnation  on  the  underlying 
features  to  detennine  their  relative  contributions.  In  addition,  weighting  the  simple  measures  can 
allow  incorporating  notions  of  criticality,  utility,  and/or  costs. 

Some  metrics  are  naturally  objective,  and  others  are  subjective.  “Believability,”  for  example,  is 
subjective  and  must  be  assessed  from  user  opinion  or  surveys  rather  than  direct  measurements  or 
observations.  In  Reference  [6],  metrics  were  developed  for  each  of  the  TDQM  dimensions  based 
on  subjective  and  objective  surveys  of  both  users  and  system.  The  exact  forms  of  the  metrics  or 
the  weighting  of  the  metrics  depend  on  the  various  contextual  situations.  For  example,  timeliness 
may  be  more  critical  in  some  applications  than  in  others.  An  interesting  observation  made  in 
Reference  [6]  was  that  the  subjective  results  often  differed,  depending  on  the  perspective  of 
those  interviewed.  For  example,  the  believability  of  the  data  was  often  different  between  the 
users  and  the  data  system  owners.  Discrepancies  such  as  this  indicate  further  analysis  may  be 
necessary. 

Many  tools  are  available  in  the  commercial  and  open-source  domains  to  support  data  quality 
measurement  and  improvement.  Data  validation  tools  examine  data  as  it  is  input  into  the  system 
and  reject  or  correct  data  item  errors.  Extract-Transform-Load  (ETL)  tools  can  sometimes  be 
configured  to  perfonn  validation  functions  as  the  external  data  is  prepared  and  entered  into  an 
existing  data  set.  Data  profiling  or  data  auditing  tools  examine  a  data  set  to  identify  problems, 
such  as  missing,  duplicate,  inconsistent,  and  otherwise  anomalous  data,  and  also  compute  data 
quality  metrics.  Data  cleansing  (or  scrubbing)  tools  go  through  an  existing  data  set  and  attempt  to 
detect,  correct,  or  remove  troublesome  data  items  (incorrect,  incomplete,  inaccurate,  and  so 
forth).  Many  variations  are  available  in  the  market,  with  some  tools  using  complex  reasoning  and 
rules  on  relations  to  correct  data  sets.  Data  cleansing  can  be  quite  time  consuming  on  large  data 
sets,  and  efficiency  is  a  key  consideration.  Other  tools  are  used  to  monitor  the  data  set  to  main¬ 
tain  the  data  quality  as  the  data  set  is  used. 

It  is  well  known  that  one-time  attempts  to  improve  data  quality  are  not  sufficient  because  data 
degrades  over  time  due  to  factors  such  as  data  change,  system  change,  and  migration.  For  exam¬ 
ple,  data  on  people  can  change  rapidly  due  to  change  of  residence,  death,  marriage,  divorce,  and 
so  forth.  It  is  generally  accepted  that  a  continual  process  to  monitor  data  quality  is  necessary. 
Also  necessary  is  the  establishment  of  clearly  defined  policies  and  governance.  Several  methods 
have  been  proposed  to  help  organizations  manage  data  quality  continuously  to  achieve  desired 
levels.  One  important  method,  based  on  a  diagrammatic  scheme  called  Information  Production 
Maps  (IPMs),  models  data  as  a  product  that  goes  through  manufacturing  stages  similar  to  an 
actual  physical  product  and  applies  similar  quality  management  procedures  [23].  IPMs  are  par¬ 
ticularly  useful  for  dynamic  decision  environments,  such  as  an  e-business,  or  C2  systems,  where 
timely  quality  infonnation  can  have  a  large  impact  on  effective  decision  making. 
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4  C2  systems 


Each  of  the  U.S.  Armed  Services  maintains  its  own  family  of  C2  systems  that  are  tailored  to  its 
particular  mission  needs:  air,  ground,  sea,  space,  special  operations.  In  joint  and  coalition  opera¬ 
tions,  each  participating  Service  or  nation  comes  with  its  own  C2  systems.  U.S.  joint  commands 
employ  C2  systems  that  must  combine  infonnation  from  the  multiple  Services.  Coalition  com¬ 
mands  must  exchange  infonnation  among  the  Services  and  with  other  countries.  These  informa¬ 
tion-sharing  requirements  cause  significant  problems  in  how  to  control  access  to  data  properly 
and  often  how  to  control  data  crossing  security  classification  domains  (cross-domain  security). 

The  functions  of  a  C2  system  are  many  and  varied.  To  better  understand  where  C2  fits 
warfighting  domain,  it  is  instructive  to  look  at  the  U.S.  Joint  Staffs  Joint  Capabilities 
(JCAs),  a  collection  of  the  primary  functions  involved  with  warfighting  [24].  C2  is  one 
top-level  capabilities.  The  nine  JCAs  are 

•  Force  Application 

•  Logistics 

•  Protection 

•  Force  Support 

•  Corporate  Management  and  Support 

•  Command  and  Control 

•  Battlespace  Awareness 

•  Net-Centric 

•  Building  Partnerships 

Within  Command  and  Control,  the  following  capabilities  are  defined:  Organize,  Understand, 
Planning,  Decide,  Direct,  and  Monitor.  As  can  be  inferred  from  these  functions,  C2  capabilities 
are  heavily  dependent  on  the  quality  of  the  information  that  is  immediately  available  or  that  can 
be  obtained  from  other  sources  and  also  on  the  ability  to  communicate  that  infonnation  to  and 
from  the  other  capabilities.  The  communication  functions  are  heavily  used  by  the  C2  functions 
but  are  primarily  included  under  the  Net-Centric  JCA  and  will  be  briefly  considered  further  in 
this  paper.  Specific  requirements  for  information  can  be  issued  from  C2  to  the  other  JCAs  such 
as  Battlespace  Awareness  or  Logistics,  for  which  many  of  the  data  quality  issues  equally  apply. 

From  a  C2  perspective,  the  key  data  issues  that  are  frequently  discussed  include  interoperability, 
distributed  access,  timeliness,  accuracy,  provenance,  and  security.  There  are  also  issues  with 
information  overload,  since  the  volume  of  available  data,  both  from  the  tactical  and  strategic 
sides,  is  rapidly  increasing.  The  data  needs  to  be  processed  in  a  timely  manner,  incorporated  into 
the  COP,  and  delivered  where  needed.  There  are  also  issues  associated  with  limited  communica¬ 
tions  capacity.  This  limits  data  availability,  and  C2  systems  must  accommodate  these  resource- 
constrained  situations.  Looking  at  this  from  the  data  quality  perspective,  we  see  that  most  of 
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these  issues  are  captured  by  the  16  data  quality  properties  discussed  previously.  A  C2  data  stra¬ 
tegy  should  explicitly  address  all  16  categories. 

Several  cases  of  dramatic  effects  are  at  least  partially  due  to  data  quality  problems.  The  uninten¬ 
tional  1999  bombing  of  the  Chinese  Embassy  in  Belgrade  by  U.S.  planes,  while  admittedly 
caused  by  a  systemic  failure  in  the  targeting  process,  was  plagued  by  data  issues  [25].  One  exam¬ 
ple  was  the  use  of  older  map  data  that  failed  to  show  the  updated  location  of  the  embassy  after  a 
move  in  1996.  Also,  the  actual  address  of  the  intended  target  (a  warehouse)  was  only  estimated 
and  not  carefully  verified  against  a  map  with  accurate  address  infonnation.  Other  problems  were 
caused  by  duplicate  target  requests  that  appeared  to  come  from  different  sources  but  were  ulti¬ 
mately  from  the  same  source  (this  is  sometimes  called  “ringing  ”  and  is  due  to  a  lack  of  prove¬ 
nance).  Further,  there  was  a  failure  to  check  the  target  against  a  database  of  known  off-limits 
targets. 

Data  quality  issues  have  also  been  identified  in  two  other  disasters:  the  space  shuttle  Challenger 
explosion  on  January  28,  1996,  and  the  shooting  down  of  an  Iranian  Airbus  by  the  USS 
Vincennes  on  July  3,  1988  [26].  The  Presidential  Commission  investigating  the  Challenger  dis¬ 
aster  cited  flawed  decision  making  surrounding  the  possible  problem  with  O-rings  at  cold  tem¬ 
peratures.  The  attack  on  the  Iranian  Airbus  was  also  attributed  to  flawed  decision  making  under 
time  pressure,  when  the  ship  identified  the  airbus  as  a  hostile  military  jet  in  attack  mode.  From 
the  data  quality  perspective,  the  decisions  were  affected  by  lapses  in  accuracy,  completeness, 
consistency,  relevance,  and  fitness-for-use  in  the  Challenger  case  and  by  accuracy,  complete¬ 
ness,  consistency,  fitness-for-use,  and  timeliness  for  the  USS  Vincennes.  For  the  space  shuttle 
Challenger,  the  data  needed  for  proper  analysis  was  available  but  not  properly  used.  It  was  not 
presented  in  a  form  that  assisted  the  management  in  making  correct  decisions.  For  the  Vincennes, 
the  initial  misclassification  occurred  when  users  did  not  realize  that  the  system  reused  a  target 
designation  number  and  then  failed  to  resolve  the  resulting  inconsistencies.  Given  all  the  pres¬ 
sures  of  decision  making,  it  is  arguable  that  data  issues  contributed  to  the  erroneous  decisions. 

Another  example  of  C2  data  quality  issues  is  exemplified  by  the  Operation  Anaconda  [27,  28]  in 
which  the  U.S.  Anny  successfully  defeated  A1  Qaeda  forces  in  the  Shahi-Kot  Valley  of  eastern 
Afghanistan  in  March  2002.  Though  the  operation  ultimately  succeeded,  the  initial  battle  plan 
required  extensive  modification.  It  was  designed  to  last  for  a  week;  however,  the  battle  lasted 
17  days,  and  resistance  was  much  more  difficult  than  anticipated,  requiring  much  more  air  sup¬ 
port.  Some  of  the  problems  were  related  to  the  quality  of  the  intelligence  data,  such  as  inaccurate 
and  incomplete  estimates  of  enemy  forces  and  their  willingness  to  fight  or  the  disposition  of 
civilians.  There  were  also  interoperability  problems  among  and  between  joint  and  coalition 
forces.  The  intelligence  data,  which  relied  primarily  on  human  intelligence,  proved  to  be  faulty 
and  was  not  properly  verified  and  vetted,  reflecting  believability  and  accuracy  issues.  The  satel¬ 
lite  imagery  was  often  3  days  old.  Decision  makers  did  not  consider  use  of  assets  such  as  a 
Global  Hawk  unmanned  air  vehicle  (UAV),  which  could  linger  over  the  area  and  provide  more 
timely  information.  Some  of  the  interoperability  issues  arose  from  a  lack  of  unity  of  command 
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due  to  the  relative  newness  of  the  Anny  forces  in  the  area  and  a  lack  of  command  authority  over 
Special  Forces,  air  support,  and  Afghan  allied  forces,  which  were  all  part  of  the  operation.  For 
example,  there  was  an  incident  where  Army  troops  mistakenly  fired  on  the  in-place  Special 
Operations  Force  (SOF)  team,  which  did  not  have  compatible  radios.  Also,  U.S.  gunships  mis¬ 
takenly  fired  on  an  allied  Afghan  column,  partially  causing  them  to  turn  away  from  the  area. 
Although  communications  reportedly  worked  for  each  U.S.  Service  component,  problems 
occurred  in  communicating  with  other  Services  and  with  allied  Afghan  forces.  In  addition,  long- 
range  communications  between  headquarters  and  edge  forces  was  bandwidth  limited,  and  com¬ 
munication  between  headquarters  and  central  command  was  inconsistent  (timeliness,  accessibil¬ 
ity).  “Anny  personnel  could  use  their  FM  radios  to  communicate  directly  with  overhead  Navy 
and  Marine  Corps  aircraft  but  not  USAF  aircraft,  such  as  F-15Es  and  bombers”  [29].  Also,  a  lack 
of  common  understanding  about  the  differing  rules  or  engagement  and  procedures  governing 
Close  Air  Support  (CAS)  contributed,  reflecting  understandability  problems  [29]. 

4.1  Interoperability 

The  ability  to  share  and  exchange  data  between  various  C2  systems  constitutes  a  serious  problem 
in  the  C2  environment.  Currently,  each  of  the  Services  has  its  own  C2  systems,  which  consist  of 
a  family  of  related  systems.  For  example,  GCCS,  a  family  of  C2  systems,  includes  over  200  sys¬ 
tems  or  services  and  is  intended  to  have  worldwide  reach  and  incorporate  components  from  all 
branches  [30].  Data  must  be  exchanged  among  the  systems  in  the  same  family  and  with  other 
non-family  systems.  This  problem  has  been  well  known  for  many  years  in  joint  and  coalition 
settings  [31],  and  several  key  developments  have  been  achieved,  such  as  the  NATO  Network 
Enabled  Capabilities  (NNEC)  COP.  The  NNEC  COP  addresses  issues  such  as  standards, 
dynamic  tailoring,  multi-level  security,  provenance,  and  knowledge  management  (timeliness  and 
access).  Within  the  U.S.  government,  the  UCore  is  being  promoted  as  a  standard  for  information 
exchange  between  systems.  DoD  has  agreed  that  all  of  the  Services  shall  use  UCore  (currently 
version  2.0)  as  the  basis  for  semantic  representation  of  data  exchanges.  In  particular,  C2  data  will 
be  aligned  with  UCore.  UCore  is  an  infonnation  exchange  specification  and  implementation  pro¬ 
file  that  defines  a  vocabulary  of  commonly  exchanged  concepts  such  as  who,  what,  when,  and 
where.  There  is  a  syntactic  representation  based  on  XML,  guidance  for  extensions  for  repre¬ 
senting  domain  (or  COI)  areas,  security  markings,  and  a  messaging  framework.  A  very  general 
taxonomy  is  defined  to  represent  basic  concepts,  but  UCore’s  generality  needs  to  be  tailored  for 
each  domain.  Semantic  layer  issues  for  UCore,  such  as  those  defined  in  the  UCore  SL  [Semantic 
Layer]  are  still  being  investigated  by  researchers  (e.g.,  National  Center  for  Ontological  Research 
(NCOR)).  Some  issues  for  further  development  in  UCore  are  temporal  relationships  and 
allowing  items  to  be  of  different  types  at  different  times  (e.g.,  weapon,  cargo,  and  so  forth). 

Other  representations  of  the  C2  domain  have  been  in  use  for  quite  some  time.  In  particular  is  the 
Joint  Consultation,  Command  and  Control  Information  Exchange  Data  Model  (Semantic  Layer), 
which  is  used  by  many  countries  and  also  by  NATO.  JC3IEDM  exchanges  are  not  XML-based 
internationally,  but  JC3IEDM  is  the  Army’s  chosen  data  model  for  infonnation  exchange  as  per 
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Reference  [32].  UCore,  the  XML-based  framework,  is  appropriate  for  infonnation  exchange 
between  the  Anny  and  other  military,  other  government  agencies,  non-government  organizations 
(NGOs),  and  the  various  multi-national  communities  (should  they  adopt  the  UCore  messaging 
specification). 

A  DoD  high-level  data  model  for  C2,  called  C2Core,  is  being  proposed  as  an  extension  to  the  C2 
domain  for  UCore.  C2Core  has  six  elements  of  C2  systems: 

1 .  Force  Structure,  Integration,  Organization 

2.  Situational  Awareness 

3.  Planning  and  Analysis 

4.  Decision  Making  and  Direction 

5.  Operational  Functions  and  Tasks 

6.  Monitoring  Progress  (Assessing) 

C2  Core  Ontology  is  based  upon  these  elements,  and  the  vocabulary  is  based  on  Joint  Doctrine. 
One  observation  is  that  the  breakdown  of  C2  elements  differs  slightly  from  the  JCA  capabilities 
mentioned  earlier.  There  is  still  work  required  to  harmonize  the  various  efforts  to  standardize 
concepts,  data  models,  and  ontologies  for  C2. 

In  a  recent  study  of  data-related  issues  [33],  it  was  noted  that  the  C2  community  could  benefit 
from  use  of  UCore  and  C2  Core  coupled  with  additional  C2-Specific  Extensions  from  UCore  to 
facilitate  data  sharing  within  the  C2  community  and  definition  of  core  C2-specific  services.  The 
Joint  C2  Conceptual  Model  and  Joint  C2  Vocabulary,  the  inclusion  of  real-world  operational 
needs,  the  JC3IEDM  artifacts,  artifacts  from  ongoing  data  exchange  development,  and  legacy 
message  formats  all  need  to  be  accommodated  in  UCore.  Several  other  key  issues  were  identi¬ 
fied,  such  as  no  runtime  component  and  a  highly  complex  underlying  model  that  is  not  easily 
implemented  in  a  modular  fashion. 

Worthwhile  future  developments  might  include  better  methods  to  enable  operators  to  discover, 
use,  and  manipulate  data  in  ways  that  cannot  be  imagined  a  priori  and  to  do  so  dynamically  while 
deployed.  These  new  developments  are  desperately  needed  by  edge  users.  There  is  also  a  great 
need  for  data  mediation  services  to  enable  interoperability  and  to  fast-track  warrior  requests  for 
data  sharing. 

Reference  [34]  presents  an  analysis  of  how  to  move  forward  with  integrating  the  various  data 
models.  The  authors  conclude  that  UCore  requires  extensions  to  include  the  full  JC3IEDM  and 
that  there  will  still  need  to  be  a  mapping  of  JC3IEDM  to  C2Core.  JC3IEDM  is  much  more 
detailed  than  what  is  currently  proposed  in  the  C2Core  and  will  require  the  stakeholder  user 
groups  to  agree  on  a  consolidated  representation  that  conforms  to  the  UCore  directive.  Some  of 
the  implementation  and  runtime  issues  in  data  sharing  are  addressed  in  Reference  [16],  which 
has  a  description  of  the  C2  Information  Sharing  Framework.  Many  of  the  specific  services 
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designed  to  improve  quality  (e.g.,  adoption  of  UCore,  C2Core,  metadata,  data  monitoring,  and 
data  access  control  and,  optionally,  reputation  services)  are  described. 

A  key  scientific  issue  relating  to  interoperability  involves  exploring  automated  methods  to 
resolve  differences  among  the  semantics  of  the  differing  systems.  Even  with  standardized  data 
exchange  methods,  there  will  be  subtle  interpretations  of  data  that  will  need  to  be  resolved.  There 
are  too  many  relationships  among  data  for  people  to  represent  and  capture  all  of  the  relations 
between  the  involved  entities,  and  the  overall  process  would  benefit  if  it  could  be  automated. 

4.2  Volume  of  Data 

It  is  well  known  that  the  amount  of  raw  and  processed  data  entering  C2  systems  is  growing 
rapidly.  With  the  expected  additions  of  more  and  more  sensors,  each  with  potentially  greater 
ability  to  produce  data,  the  amount  of  raw  data  will  explode.  Even  now,  in  some  surveillance 
applications,  data  is  being  generated  at  a  faster  rate  than  it  can  be  processed,  and  it  ends  up  being 
archived  for  later  examination.  With  the  increasing  use  of  unmanned  platfonns,  such  as  UAVs, 
the  demand  for  information  delivered  in  real  time  to  the  edge  is  also  growing.  Consider  the  Air 
Force  Reaper-mounted  “Gorgon  Stare,”  which  can  transmit  up  to  65  video  images  per  second 
[35]  or  future  systems  such  as  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  Auto¬ 
nomous  Real-time  Ground  Ubiquitous  Surveillance-Imaging  System  (ARGUS-IS)  platform  [36] 
with  1.8  Giga-pixel  video  sensors  generating  data  at  27  Gigabits  per  second.  Such  systems  can 
quickly  overwhelm  the  ability  of  C2  systems  to  process  the  infonnation.  As  a  result  of  this 
increase,  many  intelligence,  surveillance,  and  reconnaissance  (ISR)  decision  support  systems  are 
receiving  large  volumes  of  data  with  poor  control  of  data  quality  (e.g.,  noise,  clutter).  Requests 
requiring  adaptable  analysis  methods  and  unpredictable  data  requirements  are  normal.  For  exam¬ 
ple,  tracking  vehicles  in  an  urban  environment  or  identifying  the  placement  of  roadside  bombs 
from  video  are  typical  examples  of  particularly  challenging  requests.  There  is  also  the  problem 
of  short-  and  long-term  storage,  data  accessibility,  and  supplying  the  computational  power  to 
process  the  requests.  In  this  context,  timeliness  becomes  a  critical  property.  If  the  raw  or 
processed  data  is  not  available  to  track  a  target,  this  data  quickly  becomes  of  limited  value. 

There  are  a  wide  variety  of  scientific  and  technical  challenges  relating  to  handling  large  volumes 
of  data.  These  challenges  include  novel  architectures  for  storing  and  accessing  large  data  sets, 
processing  architectures  to  analyze  the  data,  and  methods  for  securely  sharing  data  and  results. 
Management  of  large  data  sets,  including  multi-level  classifications,  is  a  challenge.  Various 
research  programs  are  being  fonnulated  to  address  many  of  the  scientific  challenges  in  these 
types  of  issues.  At  least  26  research  projects  related  to  commander’s  decision  support  systems 
were  identified  in  2009  [37].  Other  newer  projects,  such  as  the  Data-to-Decisions  project,  are 
focused  on  the  issue  of  handling  the  volume  of  data.  Some  analysts  have  suggested  that  greater 
emphasis  should  be  put  on  assisting  users  to  understand  information  rather  than  designing  for  full 
automation.  However,  in  either  case,  additional  emphasis  should  be  given  to  understanding  data 
quality  and  incorporating  this  understanding  into  the  decision  processes. 


16 


4.3  Trustworthiness  of  Data 


For  C2  systems,  determining  the  level  of  trust  to  place  in  data  can  be  extremely  important.  It  is 
often  difficult  to  detennine  whether  separate  reports  are  referring  to  the  same  or  separate  inci¬ 
dents.  Data  ringing,  where  the  same  report  is  relayed  by  different  individuals,  is  a  serious  con¬ 
cern.  Similarly,  copy-paste  is  frequently  used  in  report  generation,  and  automated  tracking  of 
sources  from  copy-paste  operations  would  be  useful  in  determining  trust.  Incorporating  some 
form  of  provenance  data  is  needed  to  help  clarify  these  situations.  Within  the  DoD,  the  Services 
are  currently  focused  on  defining  Authoritative  Data  Sources  (ADSs)  and  using  a  standardized 
metadata  registry  for  data  discovery  and  use.  These  systems  have  limited  provenance  data,  pri¬ 
marily  containing  only  the  source  and  date.  Outside  of  the  authoritative  sources,  there  is  almost 
no  provenance  tracking.  In  the  emerging  research  on  provenance,  provenance  data  should  con¬ 
tain  all  the  infonnation  necessary  to  determine  the  complete  history  of  the  data.  For  certain 
applications,  such  as  bioinformatics  or  physics,  it  is  appropriate  to  capture  the  entire  workflow 
that  transfonned  the  data  from  input  to  output  for  purposes  of  validation  or  repeatability  [38,  39]. 
Other  applications  mainly  require  documentation  of  original  sources,  context  or  other  relevant 
pieces  of  information.  In  Reference  [40]  a  W7  model  (What,  Who,  When,  Where,  Which,  How, 
and  Why)  captures  relevant  infonnation  that  would  contain  full  documentation  of  a  data  life 
cycle,  from  creation  to  destruction.  There  are  many  research  activities  working  toward  auto¬ 
mating  the  capture  of  provenance  data  with  techniques  for  specialized  architectures  such  as  data¬ 
bases,  grid  computing  systems,  fde  systems,  Service-oriented  architectures,  enterprise  service 
bus  (ESB),  and  archiving  systems.  For  resource-limited  environments,  such  as  those  often  faced 
by  C2  systems,  there  are  limits  to  the  amount  of  provenance  that  can  be  stored  or  transmitted.  To 
resolve  source  attribution,  automated  tracking  of  sources  from  copy-paste  operations  would  be 
useful.  Further  research  is  needed  to  characterize  the  utility  of  provenance  models  for  the  various 
C2  scenarios. 

5  Conclusions 

We  have  described  the  general  characteristics  of  data  quality  and  given  several  examples  of  how 
these  characteristics  are  found  in  C2  systems.  It  is  clear  that  C2  systems  are  beset  with  data  qual¬ 
ity  issues  that  are  similar  to  those  found  in  the  general  enterprise  IT  community.  All  of  the  data 
quality  characteristics  are  relevant  to  C2  systems.  However,  several  quality  issues  are  of  rela¬ 
tively  greater  importance  to  C2  because  of  the  potential  lethality  of  decision-making  errors.  It  is 
also  evident  that  these  characteristics  are  not  independent  and  that  they  should  not  be  addressed 
in  isolation.  They  should  be  part  of  an  ongoing  data  quality  enhancing  process.  Incorporation  of 
current  developments,  such  as  ISO  8000  standardization  tailored  for  C2  applications,  should  be 
considered.  Data  sharing  and  interoperability  are  largely  being  addressed  by  the  C2  community. 
However,  there  is  a  great  and  difficult  challenge  in  further  automating  interoperability  between 
C2  systems.  Some  of  the  tenets,  such  as  “publish  first,”  may  need  to  be  rethought  in  terms  of 
data  quality. 
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It  is  also  recommended  that  data  quality  characteristics  and  their  associated  metrics  be  explicitly 
incorporated  into  C2  system  operations.  Research  and  development  (R&D)  is  needed  to  deter¬ 
mine  how  best  to  accomplish  this  in  a  disruption-tolerant  and  robust  fashion  within  the  con¬ 
straints  of  real-time  decisions,  limited  bandwidth  and  processing  power,  and  intermittent  service. 
The  benefits  should  include  improved  decision  making  by  making  explicit  use  of  the  data  quality 
features,  such  as  believability,  provenance,  or  reputation. 

There  is  a  great  deal  of  relevant  research  on  general  aspects  of  data  quality,  with  many  issues  still 
unresolved.  For  instance,  there  is  active  research  on  automated  provenance  handling,  but  it 
remains  a  challenging  problem.  It  is  very  difficult  to  detennine  whether  a  document  has  been 
copied  or  combined,  unless  it  has  been  under  version  control  for  its  entire  existence.  There  has 
been  little  reported  research  specifically  on  C2  data  quality.  It  may  be  beneficial  to  consider  the 
various  types  of  C2  data  when  considering  how  to  capture  C2  quality  features.  The  quality  fea¬ 
tures  of  raw  data  may  be  very  different  from  a  command  message  or  a  situation  report.  One 
approach  to  incorporating  data  quality  capabilities  is  to  provide  appropriate  metadata  with  every 
data  item  so  that  the  data  becomes  self-describing  and  self-protecting.  This  would  have  to  be  part 
of  a  tradeoff  when  resources  are  limited,  based  on  the  benefit  provided  by  the  data  quality  infor¬ 
mation.  A  logical  next  step  is  to  conduct  further  investigation  of  specific  C2  systems  for  data 
quality  characterization  to  discover  the  tradeoffs  and  to  develop  specific  metrics  appropriate  to 
C2  contexts. 

Current  practices  in  C2  involve  a  human-in-the-loop  for  almost  all  levels  of  data  entry  and  analy¬ 
sis.  The  increase  in  data  volume  is  overwhelming  and  causing  infonnation  overload.  Increased 
use  of  machine  processing  of  the  raw  data  and  elementary  data  is  necessary  if  modern  com¬ 
manders  are  to  operate  effectively  under  this  data  deluge.  The  commanders  must  be  involved  at 
the  crucial  decision  points  and  provided  with  situation  awareness  but  must  otherwise  not  be 
encumbered  by  the  lower  level  data  details.  Incorporation  of  data  quality  characteristics,  along 
with  other  forms  of  metadata  that  are  semantically  defined  and  can  be  processed  and  understood 
by  software,  may  go  far  in  providing  this  environment.  Semantic  characterizations  that  incorpo¬ 
rate  metadata  as  data  in  the  knowledge  base  and  can  be  accessed,  manipulated,  and  used  in  infer¬ 
ences  are  an  alternative  to  more  traditionally  structured  relational  databases.  These  environments 
can  naturally  incorporate  quality  features  and  use  them  to  assist  the  decision  maker  in  under¬ 
standing  the  credibility  of  the  information  relied  upon. 
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